sketch out sync codecs + threadpool by d-v-b · Pull Request #3715 · zarr-developers/zarr-python

d-v-b · 2026-02-18T20:51:17Z

This is a work in progress with all the heavy lifting done by claude. The goal is to improve the performance of our codecs by avoiding overhead in to_thread and other async machinery. At the moment we have deadlocks in some of the array tests, but I am opening this now as a draft to see if the benchmarks show anything promising.

codspeed-hq · 2026-02-18T21:12:53Z

Merging this PR will improve performance by ×5

⚡ 50 improved benchmarks
✅ 6 untouched benchmarks
⏩ 6 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,031.6 ms	270.8 ms	×3.8
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	554.3 ms	181.7 ms	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	1,551.5 ms	684.4 ms	×2.3
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	2,111.7 ms	791.6 ms	×2.7
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	1,204.9 ms	552.4 ms	×2.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	5.5 s	1.8 s	×3.1
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	9.7 s	2.6 s	×3.7
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-None]`	2.7 s	1.3 s	×2
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, 10, None), slice(None, 10, None), slice(None, 10, None))-memory]`	1,831.3 µs	662.2 µs	×2.8
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	278.1 ms	66.7 ms	×4.2
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	1,315 ms	532.1 ms	×2.5
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	1,631.2 ms	639.4 ms	×2.6
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-gzip]`	6 s	1.4 s	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-None]`	619.7 ms	143.9 ms	×4.3
⚡	WallTime	`test_read_array[memory-Layout(shape=(1000000,), chunks=(100,), shards=(1000000,))-None]`	2,886.8 ms	604.5 ms	×4.8
⚡	WallTime	`test_slice_indexing[(50, 50, 50)-(slice(None, None, None), slice(None, None, None), slice(None, None, None))-memory]`	419.2 ms	99.4 ms	×4.2
⚡	WallTime	`test_read_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=None)-gzip]`	952.5 ms	228.6 ms	×4.2
⚡	WallTime	`test_write_array[local-Layout(shape=(1000000,), chunks=(1000,), shards=(1000,))-gzip]`	3.2 s	1.5 s	×2.2
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing d-v-b:perf/faster-codecs (9d77ca5) with main (f8b3d38)}

6 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

…thon into perf/faster-codecs

d-v-b · 2026-02-19T10:53:00Z

docs/design/sync-bypass.md

@@ -0,0 +1,228 @@
+# Design: Fully Synchronous Read/Write Bypass


@rabernat @dcherian have a look, this is claude's summary of the perf blockers addressed in this PR

d-v-b · 2026-02-19T12:49:49Z

performance impact ranges from "good" to "amazing" so I think we want to learn from this PR. IMO this is NOT a merge candidate but rather should function as a proof-of-concept for what we can get if we rethink our current codec API.

Some key points:

Wrapping CPU-bound routines like gzip encode / decode with async adds needless latency. We get a lot of perf by using a sync fast path whenever possible. We need to bake this "sync is faster when available" lesson into both our codec API and store API. For example, there is no reason that reading or writing to an in-memor dict should be async.
We should design the chunk encoding process so that IO bound and CPU-bound routines are logically separated in the codebase. That means modelling sharding as a codec is probably wrong. Sharding is declared as a codec in array metadata, but we don't need to model it as a codec internally. Sharding changes how we do IO, but it should not change when we do IO.
I haven't looked at memory use at all. that's probably a separate effort.

d-v-b · 2026-02-19T13:06:24Z

the current performance improvements are without any parallelism. I'm adding that now.

d-v-b · 2026-02-19T13:53:27Z

the latest commit adds thread-based parallelism to the synchronous codec pipeline. we compute an estimated compute cost based on the chunk size, codecs, and operation (encode / code), and use that estimate to choose a parellelism strategy, ranging from no threads to full use of a thread pool.

d-v-b · 2026-02-20T15:20:17Z

marking this as not a draft, because I think we should actually consider merging it.

src/zarr/codecs/crc32c_.py

tests/package_with_entrypoint/__init__.py

src/zarr/core/codec_pipeline.py

…ospection more efficient

…into perf/faster-codecs

src/zarr/core/codec_pipeline.py

mkitti

Could we adjust work estimates based on codec parameters?

src/zarr/core/codec_pipeline.py

dcherian · 2026-02-20T20:51:09Z

src/zarr/core/codec_pipeline.py

+_MIN_CHUNK_NBYTES_FOR_POOL = 100_000  # 100 KB
+
+
+def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:


Can this be def _use_thread_pool(...)->bool instead?

dcherian · 2026-02-20T20:51:31Z

src/zarr/core/codec_pipeline.py

-
-def _get_pool(max_workers: int) -> ThreadPoolExecutor:
-    """Get a thread pool with at most *max_workers* threads."""
+def _get_pool() -> ThreadPoolExecutor:


hard to see why this had to change but... i"m not opposed to it.

src/zarr/core/codec_pipeline.py

…into perf/faster-codecs

d-v-b · 2026-02-23T12:40:25Z

The changes here improve performance a lot, but I think we can do even better with a more comprehensive set of changes. I had claude cook up a planning document based on zarrs and tensorstore here: https://hackmd.io/A933wEUwQjOx8rJmmWo13A. Please review that document. I will use this plan to guide the next round of performance improvements. Not sure if they will be in this PR or a subsequent one.

dcherian · 2026-02-24T17:33:56Z

tests/test_config.py

    assert get_pipeline_class().__name__ != ""

-    config.set({"codec_pipeline.name": "zarr.core.codec_pipeline.BatchedCodecPipeline"})
+    config.set({"codec_pipeline.path": "zarr.core.codec_pipeline.BatchedCodecPipeline"})


what's this about?

dcherian · 2026-02-24T17:35:03Z

src/zarr/storage/_local.py

+    # _open() from a sync context, so we replicate its logic here.
+    # -------------------------------------------------------------------
+
+    def get_sync(


are we able to share sync/async code paths at all?

dcherian · 2026-02-24T17:36:05Z

src/zarr/core/codec_pipeline.py

+
+# Minimum chunk size (in bytes) to consider using the thread pool.
+# Below this, per-chunk codec work is too small to offset dispatch overhead.
+_MIN_CHUNK_NBYTES_FOR_POOL = 100_000  # 100 KB


let's make this a config, so its easy to experiment with

src/zarr/core/codec_pipeline.py

dcherian · 2026-02-24T17:50:29Z

src/zarr/core/codec_pipeline.py

+        if self._all_sync:
+            # Streaming per-chunk pipeline: each chunk flows through
+            # read_existing → decode → merge → encode → write as a single
+            # task. Running N tasks concurrently overlaps IO with compute.
+            async def _write_chunk(
+                byte_setter: ByteSetter,


why is there an async func under self._all_sync? Seems like a naming issue that is very confusing to me right now

we send this function into concurrent_map, which is why it needs to be async

sketch out sync codecs + threadpool

f427898

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added benchmark Code will be benchmarked in a CI job. and removed needs release notes Automatically applied to PRs which haven't added release notes labels Feb 18, 2026

Merge branch 'main' into perf/faster-codecs

dbdc3d4

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 18, 2026

d-v-b added 5 commits February 19, 2026 08:45

fix perf regressions

65d1230

Merge branch 'perf/faster-codecs' of https://github.com/d-v-b/zarr-py…

e24fe7e

…thon into perf/faster-codecs

add partial encode / decode

f979eaa

add sync hotpath

a934899

add comments and documentation

b53ac3e

d-v-b commented Feb 19, 2026

View reviewed changes

d-v-b added 4 commits February 19, 2026 12:29

refactor sharding to allow sync

73ac845

fix array spec propagation

aeecda8

fix countingdict tests

69172fb

update design doc

28d0def

dynamic pool allocation

f8e39e6

d-v-b added 7 commits February 19, 2026 15:03

default to 1 itemsize for data types that don't declare it

b388911

Merge branch 'main' into perf/faster-codecs

7e29ef3

Merge branch 'main' into perf/faster-codecs

00dde0b

remove extra codec pipeline

9d77ca5

remove garbage

88a4875

lint

284e5e2

use protocols for new sync behavior

b1b876a

d-v-b marked this pull request as ready for review February 20, 2026 15:19

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/codecs/crc32c_.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

tests/package_with_entrypoint/__init__.py Show resolved Hide resolved

d-v-b added 2 commits February 20, 2026 20:34

prune dead code, make protocols useful

204dda1

restore batch size but it's only there for warnings

e9db616

d-v-b commented Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Outdated Show resolved Hide resolved

d-v-b added 2 commits February 20, 2026 21:09

fix type hints, prevent thread pool leakage, make codec pipeline intr…

01e1f73

…ospection more efficient

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

fbde3af

…into perf/faster-codecs

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

restore old comments / docstrings

11534b0

mkitti reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Outdated Show resolved Hide resolved

simplify threadpool management

b40d53a

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Outdated Show resolved Hide resolved

dcherian reviewed Feb 20, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Outdated Show resolved Hide resolved

d-v-b added 3 commits February 21, 2026 21:49

use isinstance instead of explicit list of codec names

83c1dc1

consolidate thread pool configuration

e8a0cc6

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

9a1d5eb

…into perf/faster-codecs

dcherian reviewed Feb 24, 2026

View reviewed changes

src/zarr/core/codec_pipeline.py Show resolved Hide resolved

dcherian reviewed Feb 24, 2026

View reviewed changes

d-v-b added 2 commits February 24, 2026 13:35

Merge branch 'main' into perf/faster-codecs

bf7073f

Merge branch 'main' into perf/faster-codecs

64ab320

This was referenced Feb 25, 2026

sketch out improved performance by refactoring codec pipeline logic #3719

Draft

chunk encoding performance improvements #3720

Open

		@@ -0,0 +1,228 @@
		# Design: Fully Synchronous Read/Write Bypass

		_MIN_CHUNK_NBYTES_FOR_POOL = 100_000 # 100 KB


		def _choose_workers(n_chunks: int, chunk_nbytes: int, codecs: Iterable[Codec]) -> int:

Uh oh!

Conversation

d-v-b commented Feb 18, 2026

Uh oh!

codspeed-hq bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by ×5

Performance Changes

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 19, 2026

Uh oh!

d-v-b commented Feb 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mkitti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

d-v-b commented Feb 23, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codspeed-hq bot commented Feb 18, 2026 •

edited

Loading